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Abstract 

We study distributed optimization problems when N nodes minimize the sum ^^^—i fiip^^ of their 
individual costs subject to a common x The /j's are convex, have Lipschitz continuous gradient 
(with constant L), and bounded gradient. We propose two fast distributed gradient algorithms based 
on the centralized Nesterov gradient and estabUsh their convergence rates in terms of the per-node 
communications /C, the per-node gradient evaluations k, and the network topology. Our first method. 
Distributed Nesterov Gradient, achieves rate O ( (i_,n(jv))3+{ ^ ( (i-,n(jv))3+^ > ^ 

arbitrarily small) when the nodes lack knowledge of the global parameters, the Lipschitz constant L 
and the spectral gap 1 - ^j,{N). When the nodes know L and ii{N), the rate is O ^^fe^) 

^ ( (i-jK(Ar))i+^ ^^f^) ' ^^"^ optimized step size. Our second method. Distributed Nesterov gradient 
with consensus iterations, assumes L and iJi{N) known by all. It achieves rate O ^ (i_^(jv))2 k.^-^ ^ 
and 0(^2)- While involving only computationally simple iterations, the methods we propose have 
strictly faster rates than existing distributed (sub)gradient methods, which have rates at most Q,{l/K?/^) 
and 0(l/fc^/^). Simulation examples with the logistic and Huber losses demonstrate that our algorithms 
outperform existing distributed algorithms. 

Keywords: Distributed optimization, convergence rate, Nesterov gradient, consensus. 

I. Introduction 

Cooperative convex optimization over networks has received much attention recently, moti- 
vated by appUcations in sensor [1], multi-robot [2], or cognitive networks [3], [4], as well as in 
distributed learning [5]. This paper focuses on the problem where N nodes (sensors, processors, 
agents) cooperatively minimize a sum of convex functions f{x) := J2iLi fii^) subject to a 
common optimization variable x G W^. Each function /j : M'^ — > M is convex and known only 
to node i. The underlying network is generic, sparse, and connected. 

To solve this and related problems, the existing literature proposes at least three distributed 
(sub)gradient type algorithms: a distributed (sub)gradient method in [6] that is further analyzed 
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in [7], [8], [9], [10], [11], [12], [13]; distributed dual averaging in [14], analyzed in [1], [15]; 
and a primal-dual (sub)gradient method in [16]. 

When the nodes lack global knowledge of the network parameters, reference [14] establishes, 
for the distributed dual averaging algorithm, rate O (^ (i_^^(jv)) W^)' where k is the number 
of communicated rf-dimensional vectors per node, which also equals the number of iterations 
(gradient evaluations per node,) and 1— //.(A^) is the spectral gap.^ Further, when /x(iV) is known to 
the nodes, and after optimizing the step-size, [14] shows convergence rate of O ^ (i_^(]v))i/2 • 

Contributions. The class of functions considered in these references are more general than 
we consider here, namely, they assume that the /^'s are (possibly) non-differentiable and convex, 
and: 1) for unconstrained minimization, the //s have bounded gradients, while 2) for constrained 
minimization, they are Lipschitz continuous over the constraint set. In contrast, we assume the 
class F of convex //s with Lipschitz continuous bounded gradients. It is well established in 
centralized optimization, [17], that one should expect faster convergence rates on classes of more 
structured functions. For example, for convex, non-smooth functions, the best achievable rate 
for (centralized) (sub)gradient methods is 0(1/Vk), while, for convex functions with Lipschitz 
continuous gradient, the best rate is 0{l/k'^), achieved, e.g., by the (centralized) Nesterov 
gradient [17]. (Here k is the number of iterations, i.e., the number of gradient evaluations.) In 
this paper, for the class and building from the centralized Nesterov gradient, we develop two 
distributed gradient methods and prove their convergence rates, in terms of the number of per- 
node communications /C, the per-node gradient evaluations k, and the network topology. Our first 
method, the Distributed Nesterov Gradient (D-NG), when the nodes have no global knowledge of 
L,G, ii{N), R, achieves convergence rate O (^ (i_^(jv))3+s • Here, L and G are the Lispchitz 
constant and the gradient bound, l—fj,{N) is the spectral gap, R is the distance to the solution, and 
^ > is an arbitrarily small quantity. When L and fi{N) are known to the nodes, the distributed 
Nesterov gradient with optimized step-size achieves O ■ Our second method. 

Distributed Nesterov gradient with Consensus iterations (D-NC), assumes global knowledge or at 
least upper bounds on fJ,{N) and L. It achieves convergence rate O ^ (i_^(jv))2 ic^) number 
of communications per node JC, and O (^) in the number of gradient evaluations. Further, we 
establish that, for the class T, both our methods (achieving at least O {log k/k)) are strictly 

'We denote by /i(Af) the modulus of the second largest eigenvalue (in modulus) of the underlying weight matrix W. Note 
that fJ-{N) depends on W, and hence the network topology. 
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better than the distributed (sub) gradient method [6] and the distributed dual averaging algorithm 
in [14]. We show analytically that [6] cannot be better than fl (see Subsection VII-B 

for a precise statement), and by simulation that [6] and [14] perform very similarly. 

Our results imply that, ignoring "{-small" and logarithmic factors, the distributed Nesterov 
gradient requires /C(iV; e) = O ^(jz^^p^ j per-node communications to achieve e-accuracy, 
while the distributed Nesterov gradient with consensus requires O (^ (i_^(^))ei/2 ^- Interestingly, 
the latter shows that, }C{N; e) scales (at most) linearly in the inverse of the spectral gap, i.e., 
lC{N]e) a(e) i_^(jv) ' '^^^ hoih our algorithms, and the algorithm in [14]. Simulations show 
very good match with the theoretical linear scaling predictions. The simulations also show that, 
although the distributed Nesterov gradient with consensus has the best asymptotic slope a(e) 
as e -> 0, for practical accuracies e, the distributed Nesterov gradient is the best among all 
algorithms. 

We remark that, in addition to distributed gradient based methods, different types of distributed 
methods have also been proposed, namely distributed augmented Lagrangian dual or ordinary 
dual methods [4], [18], [19], [20], [21], [22]. These are based on the augmented Lagrangian 
(or ordinary) dual of the original problem. These methods in general have significantly more 
complex iterations than the gradient type methods, due to solving local optimization problems 
at each node, at each iteration, but may have a lower total communication cost. Finally, we 
reference [19] that has applied Nesterov gradient method to propose an augmented Lagrangian 
dual algorithm, but that does not study its convergence rate. In contrast, ours are primal gradient 
algorithms, with no notion of Lagrangian dual variables, and we prove their convergence rates. 
Paper organization. The next paragraph introduces notation. Section n describes the network 
and optimization models. Section III presents our algorithms, distributed Nesterov gradient and 
distributed Nesterov gradient with consensus iterations, D-NG and D-NC for short. Section IV 
explains the framework of Nesterov gradient under inexact oracle; we use this framework to 
establish the convergence rate results for D-NG and D-NC. Sections V and VI prove convergence 
rate results for algorithms D-NG and D-NC, respectively. Section VII compares our algorithms 
D-NG and D-NC with existing distributed gradient type methods and discusses the algorithms' 
implementation. Section VIII provides simulation examples. Finally, we conclude in Section IX. 
Notation. We deal with both real and complex scalars, vectors, and matrices, and the notation we 
define here refers to both, unless explicitly stated otherwise. We denote by the d-dimensional 
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real coordinate space. We index by a subscript i a (possibly) vector quantity assigned to node 
i; e.g., Xi{k) is node i's estimate at iteration k. Further, we denote by: j the imaginary unit 
(j^ = —1); Aim or [A]im the entry in the l-th row and m-th column of a matrix A; and a^'^ the 
l-th entry of vector a; A'^ the transpose of a real matrix A; A^ the conjugate transpose of A; I, 
0, 1, and Cj, respectively, the identity matrix, the zero matrix, the column vector with unit entries, 
and the i-th column of /; J the N x N ideal consensus matrix J := (1/A^)11^; © and (g) the 
direct sum and Kronecker product of matrices, respectively; || • ||; the vector (respectively, matrix) 
Z-norm of its vector (respectively, matrix) argument; || • || = || • ||2 the Euclidean (respectively, 
spectral) norm of its vector (respectively, matrix) argument (|| • || also denotes the modulus of a 
scalar); Aj( ) the i-th smallest in modulus eigenvalue; A )^ means that the Hermitian matrix 
A is positive definite; [a\ the integer part of a real scalar a; V0(a;) and V'^(f){x) the gradient 
and Hessian at x of a twice differentiable function : M"* ^ M, d > 1; ((s) = Xlt^i the 
Riemann zeta function; hk — Ylt=i 1 ^^e k-th harmonic number; and = 0.577215... the 
Euler-Mascheroni constant. For two positive sequences rjn and x„, rjn — 0{xn) means that 
hmsup„^^ ^ < oo; r]n^ ^{Xn) means that liminf„_j.oo ^ > 0; and r]n = ©(Xn) means that 
Vn = 0{xn) and rj^ = ^{Xn}- 

II. Problem model 

This section introduces the network and optimization models that we assume. 

Network model. We consider a (sparse) network J\f of N nodes (sensors, processors, agents,) 
each communicating only with a subset of the remaining nodes. The communication pattern is 
captured by the graph Q = (AA, E), where C A/" x A/" is the set of links. 

Assumption 1 (Network) Graph Q is connected, undirected, and simple (no self/multiple Unks.) 

Weight matrix. We associate to graph Q a symmetric, stochastic (rows sum to one and all 
the entries are non-negative), N x N weight matrix W, with, for i ^ j, Wij > if and only 
if, {ij} e E, and Wu ^ 1 - Y^.^^ Wij. Denote by = - J. We require, with the D-NG 
algorithm, that: 

:= IIW^II < 1 and 1^ )^ 0, (1) 

while, with D-NC, we require only the first condition in (1). First condition in (1) is standard; 
the second condition of W being positive definite is what we additionally require with respect 
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to existing work, e.g., [6], [14]. However, we emphasize that, for both conditions in (1) to 
hold, nodes require no global knowledge about the network parameters, e.g., they do not need 
to know N. They can fulfill (1) if each node i knows its own degree di and the degrees of 
its immediate neighbors. A possible choice is Wij = 1/(1 + 3max{di,dj}) for {i,j} G E; 
Wij = for i 7^ j and {i,j} ^ E; and Wu = 1 — ^j-^iWij. We order the eigenvalues of W 
with Xn{W) = n{N) being largest; we have Xi{W) G (0, 1), i > 1, and Xi{W) = 0, with the 
eigenvector qi — We let W — QAQ^, where A is the diagonal matrix with — Xi{W), 
and Q = [^1, ^at] is the matrix of the eigenvectors of W. 
Distributed optimization model. The nodes solve the unconstrained problem: 

N 

minimize ^ fi(x) -. f(x). (2) 

i=l 

The function /j : R'' — > R is known only to node i. We impose the following Assumption. 

Assumption 2 (Solvability and Lipschitz continuity of the derivative) (a) Problem (2) is solvable, 
(b) For all i, fi is convex and has Lipschitz continuous derivative with constant L e [0, oo): 

\\Vfi{x)-Vfi{y)\\<L\\x-yl Vx,y eR<^. 

We denote by x* a solution to (2) and the optimal value /* :— f{x*). 

Assumption 3 (Bounded gradients) 3G e [0, oo) such that, Vi, ||V/i(a;)|| < G, e R*^. 

Examples of the fi's that satisfy Assumptions 2-3 include the logistic and Huber losses (See 
Section Vlll), or the "fair" function in robust statistics, : R i-)- R, (f){x) = ~ log (■'- + ^) ) 

where bo is a positive parameter, e.g., [23]. 

III. Distributed Nesterov based algorithms 
Subsection 111-A presents algorithm D-NG, while subsection 111-B presents algorithm D-NC. 

A. Algorithm D-NG 

Algorithm D-NG generates the sequence {xi{k)^yi{k)). A; = 0, 1, 2, .... at each node i, where 
yi{k) is an auxiliary variable. Given the initialization Xi{Q) = yi{Q), for all i, the update for 
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k — 1,2,..., at node i of algorithm D-NG is: 

x,{k) = ^W^,,y,(A;-l)-afe_iV/i(y,(A;-l)) (3) 
yi{k) = Xi{k)+Pk-i{xi{k)-Xi{k-l)). (4) 

Here, Wij are the averaging weights (the entries of W), and Oj is the neighborhood set of node 
i (including i). The step-size and the sequence /3k are: 

With algorithm (3)-(4), each node i, at each iteration k: 1) broadcasts its variable yi{k — 1) to 
all its neighbors j e O^; 2) receives yj{k — 1) from all its neighbors j e O^; 3) updates Xi{k) 
by weight-averaging its own yi{k — 1) and its neighbors variables yj{k — 1), and performs a 
negative gradient step with respect to /j; and 4) updates yi{k) via the inexpensive update in (4). 
To avoid notation explosion in the analysis further ahead, we assume throughout equal initial 
estimates Xi{0) — yi{0) — Xj{0) — yj{0) for all e.g., nodes can set them to zero. 

We adopt the sequence /3k as proposed in the centralized fast gradient method by Nesterov [17]; 
see also [24]. With the centralized Nesterov gradient, ak — a is constant along the iterations. 
However, algorithm (7)-(8) under a constant step-size does not converge to the exact solution, 
but only to a solution neighborhood. More precisely, in general, f{xi{k)) does not converge 
to /* (See [25] for details.) We force f{xi{k)) to converge to /* with (7)-(8) by adopting a 
diminishing step-size ak, as in (5). The constant c > in (5) can be arbitrary (See also ahead 
Theorem 5 and the remark below.) 

Compact form. We re-write (3)-(4) in compact form. Let x{k) — {xi{ky , X2{ky , XNik^y, 
and y{k) = {yi{ky , y2{ky , ...,yN{kyV, and introduce the map F : R^'^ as: 

F{x)^F{x,,X2,...,Xn) = {fl{xiV,f2{x2V,...,fN{xNVV. (6) 

Then, given initialization x{0) — y(0), D-NG in compact form is: 

x(k) = {W^I)y{k-l)-ak-iVF{y{k-l)) (7) 
y(k) = x(k) + Pk-i(x(k)-x(k-l)), A; = 1,2,..., (8) 

where W <Si I is the Kronecker product of W and the d x d identity. 
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B. Algorithm D-NC 

Algorithm D-NC uses a constant step-size a and operates in two time scales. In the outer 
(slow time scale) iterations k, each node i updates its solution estimate Xi(k), and an auxiliary 
variable yi{k) (as with the D-NG);^ in the inner iterations s, nodes perform three rounds of the 
average-consensus algorithm with the total number of inner iterations (r^ is specified further 
ahead.) Briefly recall the average-consensus algorithm. For a fixed k, given the initialization 
Zi{s — 0,k) e R'', i — 1,...,N, node i's update with average-consensus is: 

Zi{s, k)^Yl - 1' « = 1' 2, ... (9) 

We now detail D-NC. Given the initiaUzation Xi{0) — yi{0) e M.^, at iteration k (given Xi{k — 1) 
and yi{k — 1)) node i first calculates V fi{yi{k — 1)). Then, nodes run Tk_i iterations of (9) with 
the initialization Zi{s = ^,k — 1) = V fi{yi{k — 1)), so that each node obtains gi{k — 1) - an 
inexact version of Yl!i=i "^fiiViik — 1)). Subsequently, node i performs the update: 

xnk)=y^{k-l)-ag,{k-l). (10) 

Then, nodes jointly run the second average-consensus (9) with Tk-i iterations, but now with the 
initialization Zi{s = 0,k — 1) = xf^^{k), to obtain Xi{k)-an inexact version of ^ Yld=i xf^^{k). 
Subsequently, node i calculates yf'^'^ik) via: 

y^ik) = x,{k) + Pk-i {x,{k) - x,{k - 1)) , (11) 

where /3k is in (5). Finally, nodes run the third (last) average-consensus (9) with the initialization 
Zi{s — 0,k — 1) — y^^^{k) and Tk-\ iterations, so that node i obtains yi(A;)-an inexact version 

of ^ Yl!i=iyt^^{k)- Algorithm D-NC is summarized in Algorithm 1. 

Step-size, initialization, and the number of inner (consensus) iterations r/, . We require the 
step-size a to satisfy a < 1/(2L). With D-NC, this condition is critical for convergence. (See 
also Subsection VIII-A and Figure 1, left.) Also, we let the number of the inner (consensus) 
iterations Tk at the outer iteration k to equal: 

_ log 3 31og(fc + l) 

^To avoid notation explosion, we use the same letters to denote the iterates of D-NG and D-NC. 
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Algorithm 1 Algorithm D-NC 

1: Initialization: Each node i sets: Xi{Q) = yi{0) S M''; and k = 1. 
2: Each node i calculates Vfi{yi{k — 1)). 

3: (First consensus) Nodes run (9) for s = 1, 2, ...Tk-i, with Zi{s = 0, A; — 1) = Vfi{yi{k — 1)), so that node i 

obtains gi(k — 1) := = Tk-i,k — 1). 
4: Each node i calculates xf^^{k) via (10). 

5: (Second consensus) Nodes run (9) for k = 1,2, ...Tk-i, with Zi{s = 0, A; — 1) = xf^^{k) so that node i 

obtains Xi{k) := Zi{s = Tk-i,k— 1). 
6: Each node i calculates yf^^{k) via (11). 

7: (Third consensus) Nodes run (9) for k = 1,2, ...Tk-i, with Zi{s = 0, fc — 1) = yf^^{k) so that node i 

obtains yi{k) := Zi{s = Tk-i,k - 1). 
8: Set k<-^ k + 1 and go to step 2. 



Note that Tk increases as a logarithm with k; also, it depends on the underlying network (through 
the second largest eigenvalue of the weight matrix ij,{N).) As with D-NG, we assume Xi{0) = 
ViiO) = Xj{0) = yj{0), for all 

Compact form. We write D-NC in compact form. Use the same compact notation for x{k), 
y{k), and VF{y{k)) as with D-NG. Then: 

x{k) = {W ®lY''-'[y{k-l)-a{W ^ly^-^VFivik-l))] (13) 
y{k) = {W ^ ly"-' [ x{k) + Pk-i{x{k) - x{k - 1)) ] . (14) 

Note that the right matrix power {W /)^*-i in (13) corresponds to the first consensus on the 
V/i(|/i(/i;— l))'s; the left matrix power (VF(8)7)'^*-i in (13) corresponds to the second consensus on 
the x^^^{kys; and the power (W^(g)/)^'=-i in (14) corresponds to the third consensus on y^^^(kys. 

Performance metrics. With both algorithms D-NG and D-NC, we are interested in estimating 
the optimaUty gap in the objective function at each node i: jf{f{xi) — /*), where Xi is the node 
i's estimate of the solution at a certain stage of the algorithm operation. We are interested in how 
the node i's optimality gap depends (decreases) with: 1) the number of (outer) iterations k; and 
2) the total number of (vector) communications per node AT. Note that, with both algorithms, the 
number of outer iterations k equals the number of gradient evaluations per node. Further, with 
D-NG, k = IC, i.e., there is one and only one per-node communication in each iteration k. With 
D-NC, however, there are multiple per-node communications in each iteration k. Throughout, 
we refer to JC as the number of communication rounds. 
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IV. Intermediate results: Nesterov gradient under inexact oracle 

Subsection IV-A introduces the framework of (centralized) Nesterov gradient under inexact 

oracle and proves its relation for the progress in one iteration. Subsection IV-B shows how we 
can cast our algorithms D-NG and D-NC in this framework. 

A. Nesterov gradient under inexact oracle 

Throughout this subsection, we consider a convex function : R*^ — > R with Lipschitz 
continuous gradient with constant L^. 

Definition 1 (Pointwise inexact first order oracle) Consider a function : R'' — > R that is con- 
vex and has Lipschitz continuous gradient with constant L^. We say that a pair (j)y, gy^ e R xR*^ 
is a {Ly,5y) inexact oracle of (j) at point y if: 



Note that the pair V0(t/)) satisfies Definition 1 with {Ly = L^,6y = 0). Also, if {(j)y,gy 



Remark. The prefix point- wise in Definition 1 emphasizes that the constants {Ly,5y) are 
attached to a fixed point y. That is. Definition 1 is concerned with finding (^(f)y,gy^ that satisfy 
inequalities (15)-(16) with (Ly, 5y) at a fixed point y. Note the difference from the conventional 
definition (Definition 1) in [26]. Throughout the paper, we always refer to the inexact first order 
oracle in the sense of our Definition 1, and hence we drop the prefix point- wise. 

Nesterov gradient under inexact oracle. Lemma 2 gives the progress in one iteration of the 
(centralized) Nesterov gradient under inexact oracle for the unconstrained minimization of 0. 
Consider a point {x{k — l),y{k — 1)) e R'^ x R'', for some fixed k — 1,2, ... Let (j)k-i,9k-i^ 
be a Sk-i) inexact oracle of the function at point y{k — 1) and: 



<i){x) > (Py + g^{x-y), ^xeR" 

(f){x) < ^y + g^ {x-y) + ^\\x-yf + Sy, Vx e R-^. 



(15) 



(16) 




is a {Ly, 5y) inexact oracle at y, then it is also a (L^, 5y) local inexact oracle at y, with L'y> Ly. 



x{k) 



y{k - 1) - 7 gk~i 



(17) 



m 



x{k)+Pk-i{x{k)-x{k-l)). 



(18) 
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Lemma 2 (Progress in one iteration under inexact oracle) Consider the update rule (17)-(18) 
for some A; = 1, 2, ... Then, for any x* e R*^: 

{k + If {(t>{x{k)) - (t){x')) + 2Lk-i\\v{k) - x'f (19) 
< - 1) {^{x{k - 1)) - <P{x')) + 2Lk-i\\v{k - 1) - x'lp + {k + l)Hu-i, 



where 6ljt = 2/(A; + 2) and 



m = j^W-C-W) (20) 



Proof: Step 1. We first prove the following two auxiliary equalities: 

v{k) = v{k-l)--^ gk_^^v{k-l) + -^{x{k)-y{k-l)) (21) 

^fc_it^(A;) = x{k)-{l-ek-i)x{k-l). (22) 

We prove (21). Using = 1 - = the definition of v{k) in (20), and (17): 

m = ^^im-ii-Ok)m)-l^(^m + l^im-^k-i))-^x{k) 

k + 2 fk + l _ k_zl^^ _ ^ ^j(^) _ - 1). 

k + 2 ^ ' k + 2 ^ ') 2 ^ ' 2 ^ ^ 

By rewriting the last equation, while using again the update rule for x{k) in (17), and equalities 
¥ = si: and ^ = 

v(h) = -}-(y(k - 1) - j^g,.,) - l^^x(k - 1) 

t/fc-l J^k-1 t/k-1 

y{k - 1) - (1 - ek-i)x{k - 1) 1 1 ~ 

= n n ? Qk-l ^ v{k - 1) - gk-i. 

Cfc-l Cfc-l-t^fe-l 

In the last equality, we used again the definition of v{k — 1) derived from (20). 
We now prove (22). Multiplying (21) by 9k-i and using (20): 

9k-iv{k) = 9k-iv{k-l) + {x{k)-y{k-l)) 

= y{k - 1) - (1 - ek-i)xik - 1) + {x{k) - y{k - 1)) = xik) - (1 - ek-i)x{k - 1). 
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Step 2. We prove the following relation: 



<P{x{k)) < <P{z) + Lk-,{x{k) - y{k - l)y{z - x{k)) + 




x{k)-y{k-l)\\^ + dk-i,yzeR''. 

(23) 



Using the inexact oracle property (16): 



(P{x{k)) < (t>k-i + gU mk) - y{k - 1)) + 



2 



x{k)-y{k-l)f + Sk-i. 



(24) 



T 



Further, Vz e K"^: = 0^ {z - x{k)) = (x{k) - y{k - 1) + j^^gk-i) (z - x{k)) , and so: 



From property (15): (f){z) > (pk^i + g~l_i{z — y{k — 1)), and so, using the last equation and 

adding (24) and (25), the claim (23) follows. 

Step 3. We finalize the proof of Lemma 2 by proving (19). We start by using relation (23). 
Namely: 1) setting z = x{k — 1) in (23) and multiplying inequality (23) by 1 — 9k-i', 2) setting 
z = x* in (23) and multiplying inequality (23) by 9k-i', and 3) adding the corresponding two 
inequalities: 

0k-i W^k)) - (Pix')} + (1 - 0k-i) W^k)) - (j){x{k - 1))} 
= {ct>{x{k)) - 't>{x')} - (1 - 9k-i) {ct>{x{k - 1)) - ct>{x')} 

< Ok-iLk-i {x{k) - y{k - - x{k)) + (1 - ek-i)Lk-i {x{k) - y{k - l)f{x{k - 1) - x{k)) 

+ ^\\x{k)-y{k-l)f + Sk-i 

= Lk-i{x{k) - y{k - l)y{0k-ix' + (1 - ek-i)x{k - 1) - x{k)) + ^\\x{k) - y{k - l)f + 6k-i 
= ^{2{x{k) - y{k - l))'^{ek-ix' + (1 - ek-i)x{k - 1) - x{k)) + \\x{k) - y{k - + 5k-i. (26) 



Mk-i = {2{x{k) - y{k - l)yiek-ix- + (1 - ek-i)x{k - 1) - x{k)) + \\x{k) - y{k - 



gU - ^k)) + Lk-i{x{k) - y{k - l)y {z - x{k)) = 0. 



(25) 



Denote by: 



Then, inequality (26) is written simply as: 



{<P{x{k)) - - (1 - Ok-i) {<P{x{k - 1)) - <P{x')} < 




Mk-i + h-i- 



(27) 
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Now, we simplify the expression for M.k-i as follows. Using the identity: 

\\x{k) - y{k - l)f = 2{x{k) - y{k - l)yx{k) + \\y{k - 1)^ - \\x{k)f, 

we have: 

Mk-i = 2{x{k) - y{k - l)V{ek-ix- + (1 - ek-i)x{k - 1)) - \\x{k)f + \\y{k - l)f 

= Wvik - 1) - ((1 - 0k-i)x{k - 1) + - \\x{k) - ((1 - 9k-i)x{k - 1) + 0k-ix*)f 

= el_Av{k-l)-x'r-9l_Av{k)-x'r, (28) 

where the last equality follows by the definition of v{k — 1) in (20) and by the identity (22). 
Now, combining (27) and (28): 

(0(a;(A;)) - 0(0) - {I - e,.,mx{k - 1)) - <l){x-)) 

< ^^^^^ {\\v{k - 1) - x-r - wm - x-r) + s^-i. 

Finally, multiplying the last equation by and using 9k-i — 2/{k + 1), we get the result. ■ 

B. Algorithms D-NG and D-NC in the inexact oracle framework 

We now cast algorithms D-NG and D-NC in the inexact oracle framework. 

Algorithm D-NG. Denote by y{k) := ^J^ZiVii^) and x{k) := j^J2f=iXiik) the global 
averages of the individual nodes' iterates yi{ky% and Xi{kys, respectively. Then, by multiply- 
ing (7)-(8) from the left by (1/A^)(1^ ^ I), and using (1^ ^ I){W ^ I) ^ l'^ ^ I: 

N 

x{k) ^y{k-l)-'^Y. ^fiiVii^ - 1))' Vi^) = ^(^) + (^(^) - ^(^ - 1)) ' (29) 

i=l 

for k — 1,2, and x{0) — y(0). Thus, with D-NG, {x{k),y{k)) evolves (almost) according to 
the centralized Nesterov gradient method with step-size to minimize / := X^^i /?, except 
that the gradient V/(y(A; - 1)) = Eii V/i(y(A; - 1)) is replaced by Eii V/i(y^(A; - 1)). 
Denote by: 

N N 

fk^Yl {My^if')) + ^My^ik)Vim - y^{k))} , gk = Y, Vh{y^{k)). (30) 

i=l i=l 
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Then, algorithm (29) is, defining L'^_i — 



N . 



x{k) = y{k - 1) - j^Qk-i, y(k) = x{k) + /3k-i {x{k) - x{k - 1)) , (31) 

for A; = 1, 2, ... The following Lemma puts algorithm D-NG in the inexact oracle framework. 

Lemma 3 Let Assumption 2 hold. Then, {fk,gk) in (30) is a {Lk,5k) inexact oracle of / = 
sill fi at point y{k) with constants = 2NL and 5^ = ^ Y.f=i Wvik) - IP- 
Remark. If ak-i < 2Zk' ^^'^^ ^k-i ~ ~ 'iNLk > Lk-i — 2NL, and so the iteration 
progress shown in Lemma 2 applies to x{k),y{k) with D-NG, by setting (p = f, x{k) = x{k), 
y(k) = y{k), and x* = x*. Hence, we can establish the optimality gap f{x{k)) — f* by using 
Lemma 2 and, additionally, by finding 6k, i.e., \\y{k) — yi{k)\\. This is detailed in Section V. 

Proof of Lemma 3: For notation simplicity, we re-write y{k) and y{k) as y and y, and 
fk^gk: Lk,6k as fy^Qy, Ly,Sy. Iji visw of Definition 1, we need to show inequaUties (15) and 
(16). We first show (15). By convexity of /j(-): fi{x) > fi{yi)+V fiiyiY {x-yi), Vx; summing 
over i = 1, iV, using f{x) ^ fi{x), and expressing x - yi = x -y + y - yc 

fix) > J2 iMVi) + Vf^{y^f{y - y^)) + Yl ^f^^y^) {x-y) = Jy + gl{x-y). 

i=\ \i=l / 

We now prove (16). As /j(-) is convex and has Lipschitz continuous derivative with constant 
L, we have: fi{x) < fi{yi) + Vfi{yiy{x — yi) + ^\\x — yi\\^, which, after summation over 
i — 1,...,N, expressing x — yi — {x — y) + {y — yi), and using the inequality — — 
\\{x - y) + {y - yi)f < 2||x - y|p + 2||y - y^lp, gives: 

N / ^ \ ^ 

fix) < 5](/.(yO + V/.(?/0^(y-y.))+ J]V/.(i/.) (x-y) 

1=1 \i=l / 

^ ~ ^ 2NL 

+ NL\\x - yf + L ^ ||y - yif = fy + gy{x -y) + Ik - + 

i=l 



Algorithm D-NC. Denote (the same as with D-NG to avoid notational clutter) x{k) : = 
Y.f=lX^ik), and y{k) := ^ Etil/.(^)- Multiplying (13)-(14) from the left by (l/iV)!^ ® /, 
and using (1^ (8) I){W ^ I) — 1^ <^ I, we get that {x{k),y{k)) satisfy (29), with a^-i replaced 
by the constant a. Hence, with both D-NG and D-NC, the global averages x{k) and y{k) follow 
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the same rule (29), except for different step-sizes. What is also different, however, are the values 
of disagreements \\y{k) — yi{k)\\. Define fk,gk for D-NC as in (30), and consider (31) and 
Lemma 3. Because a < 1/(2L), we have L'j._^ = > 2NL, and so. Lemma 2 applies to 

{x{k),y{k)) of D-NC by setting = /, x{k) = x{k), y{k) = y{k), and x* = x*. Hence, as with 
D-NG, convergence analysis of D-NC boils down to finding the disagreement \\y{k) — yi{k)\\ 
and then applying Lemma 2. This is considered in Section VI. 



Subsection V-A bounds the disagreement \\y{k) —yi{k) \\ with D-NG. Subsection V-B combines 
this bound with Lemma 2 to derive the convergence rate of D-NG; this subsection also studies 
how the convergence rate depends on the underlying network. 

A. Algorithm D-NG: Disagreement estimate 

In this subsection, to avoid notational clutter, we let c? = 1, but all the results hold for a 
generic d. We consider algorithm D-NG and derive a generic bound on the differences between 
the estimates Xi{k) at different nodes. More precisely, we derive an upper bound on 
where x{k) :— x{k) — x{k)l — {I — J)x{k). Denote also by y{k) :— y{k) — y{k)l — {I — J)y{k), 
and recall W .^W - J, n{N) = \\W\\. 

Lemma 4 (Consensus estimate) Consider algorithm D-NG with step-size ak — c/{k + 1) under 
Assumptions 1 and 3. Then, for A; = 1, 2, 



V. Algorithm D-NG: Convergence analysis 



^(A;)|| < VNcGC, 



1 



and \\y{k)\\<AVNcGC, 



1 

k' 



cons 



k 



cons 



where Ccons is a constant that depends on the matrix W and equals: 



cons 




(32) 



where B : (0, 1) ^ M, B{r) = sup^>i/2 (^rMog(l + z)) . 



We note that B{r) is finite whenever r e (0, 1), which is the case here. 
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Proof: We outline the main steps in the proof. First, we model the dynamics of {x{kY , x{k — 
1)^)^ as a linear time varying system with (7 — J)VF(y(/c)) being the system inputs and 
determine its solution. Second, we calculate the underlying time varying system matrices. Third, 
we upper bound the norms of the time varying system matrices. Finally, in the fourth step, we 
use these bounds and a summation argument to complete the proof of the Lemma. 

1) Recursion for {x{ky ,x{k - 1^^: Consider (7)-(8) with d = 1 (Here W ^ I = W.) 
Substituting the expression for y{k — 1) in (7); multiplying the resulting equation from the left 
by (I - J), and using (7 - J)W ^ W(I - J), obtain: 

x{k) = {l + /3k-2)Wx{k-l)-Pk-2Wx{k-2)-ak-i{I-J)VF{y{k-l)),k^l,... 
x{0) = {I-J)x{0), x{-l) = 0, 

and Pk, for k — 0,1, is in (5), and = 0. Recall that we assumed Xi{0) — Xj{0) for all 
i,j, so that x{0) — 0. Next, the recursion for the 2N x 1 augmented state {x{k)~^,x{k — 1)^)^ 
becomes, for k — 1,2, 



(33) 











x{k - 1) 


- Ctk-l 


(/- J)VF(y(fc-l)) 


x{k - 1) 




I 




x{k-2) 








with (x{Oy,x{-iyy = O. Define the 2N x 2N system matrices: 



and k) ~ I. Then, the solution to (33) is 

fc-i 



l + h-s)W -h^sW 
I 



, k>t. 



(34) 



x{k) 
x{k-l) 



t=o 



-(I-J)VFiyit)) 




, k = l,2. 



(35) 



2) Calculating ^{k,t): We now show the interesting structure of the matrix ^{k,t) in (34) 
by decomposing it into the product of an orthonormal matrix U, a block-diagonal matrix, and 
U~^. While U is independent of k,t, the block diagonal matrix depends on k,t, and has 2x2 
diagonal blocks. Consider the matrix in (33) with A; — 2 = for a generic t — —1,0,1, ... Using 
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the eigenvalue decomposition W — QAQ^: 



I 



{Q®Q) 



(1 + A)A -AA 
/ 



(36) 



where P is the 2N x 2A'^ permutation matrix (cj here is the i-th column of the 2N x 2N identity 
matrix) P — [ei, cn+i, 62, e;v+2, gat, e2Ar]^ , and is a 2 x 2 matrix: 



1 



(37) 



Using (36), and the fact that (Q © Q)P is orthonormal: ((Q © Q)P) ■ {{Q © Q)P)^ = (Q © 
Q)PP^{Q © = (ggT) © (gg^) = /, we can express ^{k,t) in (34) as: 

:= (g©g)P(©iIinJ-^+iSi(A;-s))P^(g©g)^, for A; >t; A;) = /. (38) 

3) Bounding the norm of^{k,t): We next upper bound ||$(A;,t)||. As {Q®Q)P is orthonor- 
mal, t) has the same singular values as ®^^]Ii^^Z^^T.i{k — s) , and so these two matrices also 
share the same spectral norm (maximal singular value.) Further, the matrix ©^^nJ~2^^Ej(A; — s) 
is block diagonal (with 2x2 blocks n^~2"^^Ej(/i; - s)), and so: 



||$(A;,t)||= max n™^Ei(A; - s 

1=1,..., AT 

We proceed by calculating ||n^^2^^Sj(A; — We distinguish two cases: i = 1, and i > 1. 

Case z = 1. As \i{W) = 0, we have that, for all t, [Ei]2i = [Si(t)]2i = 1, and the entries 
(1, 1), (1,2) and (2,2) of Si are zero. Note that ||Ei|| = 1, and (Ei)^ = 0, s > 2. Thus, as long 
as k > t + 1, the product n^~2"^^Ei(/c - s) = 0, and so: 



\TTk-t+l 



S = 



1 if k = t+l 
if A; > 



(39) 



Case i > 1. To simplify notation, let Aj := Xi{W), and recall Aj e (0, 1); Ei(t) is: 



Ei{t) = Ei - 



t + 3 





2\ -\, 




Aj — Aj 










1 
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The matrix Ej is diagonalizable, with Ej = QiViQi , and 
Qi = 



Ai+jVAi(l- Ai) Ai-jVA»(l-A,) 
1 1 



A,+j\/A,(l-Ai) 

Ai-j^Ml^AO 



Also, Qi 



1 -Ai+jv/Ai(l-Ai) 
-1 Xi+yXi{l- Xi) 



(Note that the matrices Qi and are 



complex.) Denote by Vi{t) = ^Pi-^gQ^ ^^iQi- Then, Ei(i) = (v, - j^Q'^AiQ,^ Q'^ 



QiC'i{t)Qi , and so, by the sub-multiplicative property of norms: 



\\nU^'Uk-s)\\< 



(40) 



Further, the norms of Qi and Qi are upper bounded as: 



Qi 


< V2 


Qi 


= 2^2, 

oo 


Qi ' 


< V2 


Qi ' 



2^2 



oo VA,(1-A,)' 

It remains to upper bound for alH = — 1, 0, 1, ... We will show that 



(41) 



\\Vi{t)\\<^/Xi, Vt= -1,0,1, 



Denote by = t = 0, 1, and a_i = 1. After some algebra: 



(42) 



Vi{t) = 



{2-at){K+yXi{l-X^)) at{\-yXi{l-Xi)) 
at{Xi+j^yXi{l-Xi)) (2-at)(A,-jVAi(l-Ai)) 



and: 



Vi"{t)Vi{t) 



at{2-at) 
2 



at(2-at) 
2 



(2A2-A,-2jA,VAz(l-Ai)) 

at+(2-at)^ X 
4 



4 

(2A2-A2 + 2jA,VAi(l-A,)) 

Next, very interestingly: = || + || [Vi"" {t)Vi{t)\^^^^ - ^ 

(2 — atY)Xi + |at(2 — at)Xi,— Aj. for any at G [0,2], which is the case here because at — 
3/(t + 3), t = 0,1,..., and a_i = 0. Thus, as \\A\\ < \\A\\i for a Hermitean matrix A: ^^(t)!! = 



||r'i^(t)r'i(t)|| < \\Vi" {t)Vi{t)\\i = ^/Xi. Applying the last equation, (41) and (42) to (40), 
we get, for i 7^ 1: 

k-t 8 / , — \ k-t 

^_^(Va;) ,....1. (43) 



m'-i-^'uk - s)\\ < 



Q^ 



Qi 



\/Ai)' 
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Finally, using \mk,t)\\ = max,=i,...,jv \\U^,zi+'Ei{k - s)\\, and (39), (43) 

8 / 



HKt)\\ < , _ (VJ^) :k>t. (44) 

min,e{2,iv} J X,{W){1 - Xi{W)) 



4) Summation: We apply (44) to (35). Using the sub-multiplicative and sub-additive properties 
of norms, expression at — c/{t + 1), and the inequalities < ||(5'(A;)''',x(/c — 

-{I-J)VF{y{t)y,0^y\\<VNG: 



8\/NcG ^ / / \k-{t+i) 1 

wmw < : E (^m^v)) (45) 

We now denote by r := -^f ^{N) e (0, 1). To complete the proof of the Lemma, we upper bound 
the sum Y^^Zl jt+Tj splitting it into two sums. With the first sum, t runs from zero 
to \k/2'\, while with the second sum, t runs from [A;/2] + 1 to A; : 

U (^ + 1) V 2 + - + " \k/2\)^y \k/2\+l^-^ k) 

< (^1 + 1 + ... + 1 ^ + ^ + + 

Using the following inequality for the harmonic number ht — l + | + + t — 1,2,...: 

ht <\ogt + Ch + \-, t = 1, 2, we have: 

fc— 1 

E^'~^*^'^77TTT ^ r^/'(log(l + A;/2) + C. + l/2) + |-^ 

= 2 {r'=/Mog(l + fc/2)(fc/2)}i + {2(C, + l/2)r^/2(A;/2)}i + ^^ 
< 2 sup {r^ log(l + z)z} \ + 2{Ch + 1/2) sup {r^z} \ ^'^ ^ 



2>i/2 ^ ^>i/2 k kl — r 

< l2B(r) + 2(C, + l/2)- ^ ^ ^ ^ 



e(— logr) 1 — r J k 

Finally, applying the above to (45), and using the Ccons in (32), the claim of Lemma 4 on 

fc-i I 



\x{k)\\ follows. Then, as y{k) = x{k) + ^{x{k)-x{k-l)), we have that ||y(A;)|| < 2||x(A;)|| + 



\\x{k - 1)||. Further, by the claim of Lemma 4 for x{k): \\x{k - 1)|| < cVNGCconskZ^^ < 
2c\/7VG'Cco„sp k>2, and p(0)|| = (by assumption). Thus, \\x{k - 1)|| < 2cy/NGCconsl, 
Vk > 1. Thus, \\y{k)\\ < 2\\x{k)\\ + \\x{k - < 4c\/iVGCconsi Vfc > 1. ■ 
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B. Convergence rate and network scaling 

We now state the convergence rate result for algorithm D-NG. 

Theorem 5 Consider algorithm D-NG under Assumptions 1-3, with step-size ccfe = (jfc^> 
c < 1/(2L). Let \\x{0) - x*\\ <R,R>0. Then, for all i, for all A; = 1, 2, 

i (/(..(*)) - /*) < ^ (i) + 16c= iCL.G= (i I + cG'C„„ (1)(46) 

Remark. Theorem 5 extends to the case c > 1/(2L); in that case, (/(xj(A;)) — /*) < 
C'(iV) {lElijBw)^ where C'(iV) = C"(L, G, i?, c) + IGc'LCl^^G^ + cG^Ceons, where 
C"(L, G, i?, c) e [0, oo) is a constant that depends on L, G, R, c, and is independent of N, iji{N). 
The proof of this extension is in Appendix. 

Remark. Note that distributed Nesterov gradient uses the diminishing step size ak — c/{k+l). 
Even the centralized Nesterov gradient with ak = c/{k + 1) achieves the rate 0(1 /k) (and 
not 0(1/ k"^)). We use a diminishing step size with distributed Nesterov gradient, and not 
the constant one - which would give 0(l/fc^) in the centralized setting, is that the disagree- 
ment estimate \\y{k) — yi{k)\\ would not converge to zero. Thus, the oracle inexactness 5k — 
^Y^f=i \\y{^) ~ Lemma 2 would not converge to zero, and the effect of 5k would 

accumulate too much over iterations k, destroying the convergence rate. In fact, algorithms 
distributed Nesterov gradient and distributed Nesterov+consensus offer two different ways to 
control 5kS. distributed Nesterov gradient does this through a diminishing step-size a^, while 
D-NC uses the inner consensus iterations (while using the constant step size = a). 

Proof of Theorem 5: The proof consists of two parts. First, we estimate the optimality 
gap j^{f{x{k)) — /*) at the point x{k) — '^f=iXi{k) using Lemma 2 and the inexact oracle 
machinery. Second, we estimate the optimality gap ^{f{xi{k)) — /*) at any node i using 

convexity of the fi's and the bound on the distances — x{k)\\ from Lemma 4. 

Step 1. Optimality gap {f{x{k)) — /*). Recall that, for k — 1,2,..., (fk,gk) in (30) is 
a {Lk,5k) inexact oracle of / at point y{k) with Lk — 2NL and 5k — L\\y{k)\\'^. Note that 
(/fe, Qk) is also a (L'^, 5k) inexact oracle of / at point y{k) with Lj^. = ^^{k -I- 1) = because 
^ > 2L, and so L'^ > Lk. Now, we apply Lemma 2 to (31), where we set (j) = f, x{k) = x{k), 
y{k) = y{k), x* = x*, and the Lipschitz constant L'^. = l/{ak/N) = 2NL{k + 1). (Note also 
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that Sk = LE-li Mk)-yik)f = L\\y{k)\\'-) We get: 
Because > ^^+^/^~\ and - /*) > 0, we have: 

/■jt. -I- 1 "12 _ 1 o/Y 

^ fc^i - n + —\\m - (47) 

^•2 — 1 9 A/" ("i- -I- 1 "l^ 

< ^^(/(x(fc-l))-n + — Kfc-l)-a;*||2 + L||^(/e-l)f^^^, (48) 

which, by unwinding the above recursion, and using ^;(0) = x{0), gives: ^^^^^""^ {f{x{k)) — /*) < 
^||x(0) - x*\\'^ + LJ2^^^ \\y(t - Applying Lemma 4 to the last equation, and using 

(^TIF^ = wk ^ wk = h 11^(0)11 = (by assumption), gives: 

Step 2. Optimality gap {f{xi{k)) — f*). Fix an arbitrary node i; then, by convexity of 
J = 1,2,...,N: f,(x{k)) > f,{x,{k)) +Vfj{xi{k)y{x{k) - x,{k)), and so: /,(x,(A:)) < 
/j(x(A;))+G||x(A;)— Summing the above inequalities for j = 1, A^, using J2iLi 

= Z^ili ll^j(^)ll ^ ^^11^(^)11' subtracting /* from both sides, and using ||x(A;)|| < 
y/NcGCcons{^/k) from Lemma 4: 

1 



f(xi(k)) - r < f(x(k)) -r + GKVN\\x(k)\\ < f(x{k)) - r + cNC^^,G^, (SO) 
which, with (49) (in which the summation variable t is replaced by t + 1) completes the proof. 



Network Scaling. We examine how the convergence rate depends on N and the network 
topology. We assume that L, G, and R do not depend on N. Just to formally set up the network 
scaling, suppose that we have a given sequence of the weight matrices W'^^\ W'^'^\..., W'^^\..., 
where the size of W^^^ is N x N. (The matrices increase in size N to emulate the increase of 
the network.) Then, we consider how C{N) in Theorem 5 depends on the matrix W = W^'^\ 
Recall that iJi{N) equals the second largest eigenvalue of W'^^\ For simplicity, write simply W 
instead of 1^^^^; likewise, we write W — W^^^ — J.. With many network models, /J-iN) — > 1 as 
N ^ oo, i.e., the network speed of consensus deteriorates with the increase of N. The expander 
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graphs are exceptions, for which the weight matrix W can be chosen such that l — fj,{N) 
We distinguish two cases: 1) nodes do not know L and Ijl{N) before the algorithm run, and they 
set the step-size constant c to a constant independent of N, e.g., c = c{N) — 1; and 2) nodes 
know L,fj.{N), and they set c = c{N) = 9 (1 - iJ,{N)) . 

Theorem 6 Consider algorithm D-NG under Assumptions 1-3, with step-size ak — c/{k 
Also, suppose that \2iyV) = as N ^ 00} 
(a) For arbitrary c = c{N) = const > 0: 



Proof of Theorem 6 is in Appendix. We can derive explicitly how 1/(1 — /^(^)) depends 
on for commonly used models, like grids, geometric graphs, and expanders. The resulting 
dependence of 1/(1 — A*(A^)) on N can be found in [14]. 



Subsection VI-A provides the disagreement estimate, while Subsection VI-A gives the con- 
vergence rate and network scaling. 

A. Disagreement estimate 

Consensus Estimate. For notational simplicity, throughout this subsection, we set ci = 1, 
but all results hold for generic d. We now estimate the disagreement x{k) :— x{k) — x{k)l — 
{I - J)x{k), and y{k) := y{k) - y{k)l = (/ - J)y{k) with the D-NC. 

Lemma 7 Let Assumptions 1-3 hold, and consider algorithm D-NC. Further, set the number of 
the inner consensus iterations at the /c -outer iteration k as in (12). Then, for A; = 1, 2, 




(b) For c = c{N) = 9 (1 - /i(iV)): 




(51) 



VI. Algorithm D-NC: Convergence Analysis 



x{k)\\ < 2aVNG^, \\y{k)\\ < 2a^G^. 



(52) 



^This is true, e.g., with the weights example based on the neighbors' degrees in Section II. 
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Proof: Denote by Bt-i :— max{\\x{t — 1)||, \\y{t — 1)||}, and fix i — 1. We want to upper 
bound Bf Multiplying (13)-(14) by (7 - J) from the left, using (7 - J)W = W{I - J): 

x(t) = W^*-'y{t-l) -aW^^''-'{I - J)VF{y{t-l)) (53) 

y{t) = (54) 

From (53) and (54), using the sub-additive and sub-multiplicative properties of norms, 
fx{N) e (0, 1), 11(7 - J)VF{y{t - 1))|| < \\VF{y{t - 1))|| < ^G, and A-i < 1: 



\\x{t)\\ < {l^{N)y'-' \\y{t - 1)11 + a{fi{N)f^^-'VNG 

< {ii{N)y>'-^Bt-i + a{ii{N)f^*-'VNG (55) 

||^(t)|| < 2(Miv))--p(t)|| + (Miv)r-p(t-i))ll 

< 2 {i^{N)f^*-'Bt_, + 2a^{i^{N)f^*-'G + {i^{N)y*-' B^^^. (56) 

Using (/i(iV))2^'-i < (/x(iV))^*-i and {i^{N)f^'-^ < {fi{N)y-^ to further upper bound (55) 
and (56), and taking the maximum over the two resulting inequalities: 

Bt<3 {n{N)y-'Bt-i + 2a\fN{n{N)y'-'G. (57) 



Now, recall tu-i in (12). We have that ?,{^{N)y'-^ = 3e~^^^^^°^^^^^e~^^^^^°^^^^^ = ^. 

3 3 

Applying the latter to (57), and using (/i(iV))^'-i < (/i(iV))"rav) = g-kl^ i°gM(^) = j.. 
Bt < -^Bt-i + ^2a\fNG. Next, using Bq — 0, and unwinding the latter recursion for t — 
k,k- A; > 1: 

< 2aVNG (± _ \)3..,3 + ,3(,_\)3...y ) < 2«v/^g4 = 2„v/]VG 1. 



fi. Convergence rate and network scaling 

We are now ready to state the Theorem on the convergence rate of D-NC. 

Theorem 8 Consider algorithm D-NC under Assumptions 1-3, with the constant step size a < 
1/(2L), and the number inner iterations as in (12). Let \\x{0) — x*\\ < R, R > 0. Then, after 

fe-i 

" ^ ~ log fi{N) 



/C = 3 ^ < _^J^^fj^^ {k log 3 + (A; + 1) log(fc + l))^0{k log k) (58) 
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communication rounds, i.e., after k outer iterations, we have, at any node i: 

^{f{xi{k))-n < ^ (^^R' + 11 a' LG' + aG^y A; = 1,2,... (59) 

Proof: As with the proof of Theorem 5, we divide the proof in two steps. In the first step, we 

upper bound the optimality gap f{x{k)) — /*; in the second step, we upper bound the optimality 

gap at each node f{xi{k)) — f*. 

Step 1: Upper bounding f{x{k)) — f*. Recall that the evolution (29) with = a for 
{x{k),y{k)) is the Nesterov gradient with the inexact oracle { fk,gk) in (30), and {L^ = 2NL, 6k = 
L\\y{k)\\'^). Also, note that, from Lemma 7, Sk < Aa^LNG'^^. Now, similarly to the proof of 
Theorem 5, we apply Lemma 2 with cf) = f, x{k) = x{k), y{k) = y{k), v{k) = v{k), and 
X* = X* to get: 

{k+ifu{m)-n + —\\v{k)-^*\? 

9N 

< {k' - 1) {f{x{k - 1)) - /*) + —\\v{k - 1) - x*f + L\\y{k - l)f{k + If. 

a 

Next, using {k + if {f{x{k)) - /*) > {{k + if - 1) {f{x{k)) - /*) , by unwinding the above 
recursion, and using ^;(0) = x{0): 

OAT ^ 

{{k + If - 1) ifim) - n < —\\m - x*r + E » - + 1)'- 

The latter equation, using + 1)^ - 1 = k(k + 2) > k'^, and substituting \\y(t - 1)|| < 
2a^/NG^ for t > 1, and \\y{t - 1)|| = 0: k' (/(x(A;)) - /*) < f ||x(0) - x*f + 
Aa'^LNG'^ St=2 f^iyi' gi^ss the optimality gap at x{k): 

f{x{k))-r < p^^M-x^ll^ + lla^LTVG^^ (60) 

where we used 4 l&f = 4 YZi ^ = 4 (C(2) + 4C(3) + 4C(4)) < 11. 

Step 2: Upper bounding f{xi{k)) — f*. This step is the same as with Theorem 5; setting 
\\x{k)\\ < a^/NG^, obtain: f\xi{k)) - t < f{x{k)) - f* + aNG^^; combining the latter 
with (60), we get the desired result. ■ 

Network scaling. We now give the network scaling for algorithm D-NC in Theorem 9. As 
before with algorithm D-NG, we assume that L, G, and it! do note depend on N. Also, we 
assume that nodes know L and Ijl{N) before the algorithm run. Proof of Theorem 9 is omitted 
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and is provided in Appendix. 

Theorem 9 Consider algorithm D-NC under Assumptions 1-3 with as in (12) and a < 
1/(2L). Then, after K, communication rounds, at any node i, the optimality gap ^ {f{xi) — /*) 



VII. Comparisons: D-NG and D-NC, and existing algorithms 

We compare D-NG, D-NC, and existing distributed (sub)gradient algorithms in [6], [14], from 
the aspects of implementation (Subsection VII- A) and convergence rate (Subsection VII-B). 

A. Comparisons: Algorithm implementation 

We discuss the required (global) knowledge for the algorithms' convergence, and on the 
required knowledge for the algorithm stopping and tuning of the step-size. 

Algorithm D-NG. Interestingly, D-NG does not require any global knowledge for conver- 
gence ( achieving the rate O (^ (i_^(]v-))3+g ^^))' s'^cept that each node knows only its own and 
its neighbors' degrees dj, j e Oi. 

We now comment on the required knowledge for nodes to a priori determine the stopping 
iteration K when j^{f{xi{k)) — /*) < e, for all i, for all k > K, and a small e > 0. To 
determine K, all nodes need C{N) (see Theorems 5), and so they need: a Lipschitz constant L; 
a gradient upper bound G; an upper bound on //(iV); and an upper bound on R. Clearly, the 
same knowledge is also needed for setting the optimal c* - the value of c that minimizes C{N) 
over c e (0, 1/(2L)]; c* can be easily numerically obtained, as C{N) is convex in c. 

Algorithm D-NC. In contrast with D-NG, D-NC needs upper bounds on ij,{N) (to set 
in (12)) and L for convergence. (See ahead Figure 1, left.) With respect to the stopping iteration 
and the optimal step-size a, all nodes need upper bounds on L, G, iJi{N), and R - the same as 
with D-NG. Again, a* that minimizes the constant ^ + lla^LG^ + aG'^ over (0, 1/(2L)] can 
be easily found numerically, as ^ + lla^LG^ + aG"^ is convex in a. 

Finally, D-NG has a simpler structure than D-NC, and hence admits easier implementation; 
this is relevant, e.g., with the inexpensive sensor network mote processors. 

Algorithms in [6], [14]. We focus only on [14], while [6] is similar. Like D-NG, [14] does 
not require any global knowledge for convergence with the guaranteed rate O ( ) . 



The weight matrix W with [14] does not need to satisfy (1); however, [14] still needs fJ>{N) < 1. 
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We are not aware if it is possible to set a symmetric W with ij,{N) < 1 using a "less global" 
knowledge than what we require with D-NG (each node knowing its own and the neighbors' 
degrees.) For stopping and optimal step-size, [14] requires all that D-NG requires, except an 
upper bound on L. 

Comments on acquiring the global knowledge ii{N), L,G, R. We comment how nodes 
can acquire L, G, R and ii{N) before the run of an algorithm to solve (2). Consider first L 
(see Assumption 2.) Suppose that each node knows a Lipschitz constant Lj of its own cost 
function /j. Then, L can be taken as L = maxj=i^..._iv L^. Thus, each node can compute L if 
all nodes run a distributed algorithm for maximum computation, e.g., algorithm in equation (1) 
in [27]; all nodes get L after O(Diam) = 0{N) per-node communicated scalars, where Diam 
is the network diameter. Likewise, a gradient bound G (see Assumption 3) can be also taken 
as G = maxi=i....^Ar Gj, where Gi is a gradient bound for the /j. The quantity A*(A^) (equal to 
the second largest eigenvalue of W) can be computed in a distributed way, e.g., by algorithm 
DECENTRALOI, proposed for a more general setting in [28], and adapted to the problem like 
the one we consider in [[29], Subsection IV-A, p. 2519]. With DECENTRALOI, node i obtains 
gf, the i-th coordinate of the x 1 eigenvector q^^ of W that corresponds to //(A^), (up to 
e-accuracy) after O (^^^^-^^j^^y^^ per-node communicated scalars [28]; then, node i obtains 
IJ,{N) as: 



B. Comparisons: Convergence rate 

Absence of global knowledge. When no global knowledge except the neighbors' degrees is 
available, D-NG achieves O ( (i,^(]v))3+e ¥) ( (1-^^))^+^ ¥)• Without global knowl- 

edge of (at least upper bounds on) fi{N) and L, D-NC is not guaranteed to converge. Algorithm 
in [14] achieves O ( (i_^\jv)) ^) and O ((Izt^IttI)- The latter suggests that, for larger k 
and JC (higher accuracy), D-NG is better than [14]. On the other hand, for sufficiently small k 
and JC, and sufficiently large 1/(1 — ijl{N)), [14] may be better. (The optimality gap with [14] 
increases slower with 1/(1 — ijl{N)).) However, in our simulations with up to 400 nodes, D-NG 
had a smaller optimality gap than [14] for any value of k and )C. (See also Figure 1.) 



Presence of global knowledge L and yu(A^). The algorithm D-NG achieves O 



log k 



(1-m(A'))1+« k 



O ; D-NC is O (^) and O {^r^Kh) ■ The algorithm in [14] is O (^^^^ 

and O (| (i_^()v))i/2 ^f^j- Thus, in terms of k, D-NC is better than D-NG. In terms of /C, our 
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big-O results suggest that, for sufficiently small /C (lower accuracy), D-NG may be better than 
D-NC. Likewise, [14] may be better than D-NG and D-NC for sufficiently small /C (low 
accuracies.) However, for sufficiently large JC (high accuracy), D-NC eventually becomes the 
best. In all simulations that we performed, with networks up to 400 nodes, D-NC becomes better 
than D-NG only at very high accuracies - ^^^"^7^ (/* 7^ 0) of order 10~^ — 10~^ or smaller 
(or never becomes better,) and D-NG is better than [14] for all accuracies. Finally, consider 
IC{N;e) - the number of communication rounds needed to reduce ^(/(xj) — /*) below e. 
With D-NG, )C(N; e) = O ( (iz^) ; with D-NC, /C(7V; e) = O {^^zj^hw^) ^ with [14], 
1C{N; e) = O ( (i_^(^jv))e^ ) • Hence, with all algorithms, for a fixed e, ]C{N; e) ^ a(e)l/(l-//(A^)), 
i.e., it scales linearly. While D-NC guarantees the smallest (best) slope a(e) when e — > 0, D-NG 
shows the best slope for practical accuracies in simulation (See Figure 2.) 

The Q{l/k'^^^) lower bound on the worst-case optimality gaps for [6]. We now focus on 
the dependence of convergence rate on k and /C only (assuming a finite, fixed 1/(1 — //(A^)).) 
It is important to note that the optimaUty gap Oi}.ogk/k}/'^) and 0(log/C//C^/^) in [14] holds 
for the non-differentiable convex /j's that are Lipschitz continuous (have bounded gradients) - 
a wider class of functions than what we consider. To our knowledge, a detailed study of the 
algorithms in [6], [14] under Assumptions 2 and 3 does not exist. We thus demonstrate here that 
D-NG has a strictly better convergence rate in k (and /C) than [6], when applied to the class 
of functions defined by Assumptions 2 and 3. (Thus, D-NC also has a better rate.) The method 
in [14] performs very similarly to [6] (or slightly worse) in our simulations. 

We clarify mathematically the claim that we make. Fix a generic, connected network Q with 
N nodes, and fix a weight matrix W that satisfies (1). Let T — T{L, G) be the class of 
all convex functions ^ : R*^ — > R that have Lipschitz continuous derivative with constant 
L and bounded gradient with bound G. Consider (2) with fi G J^, for all i\ consider D- 
NG with the step-size ak = j^^y k = 0,1,..., c < 1/(2L). Denote by: S^-^^{k,R) = 
sup/.e^sup{^(o):||x(o)-x*||<i?} niaxi=i,...,jv {f{xi{k)) - f*} the optimality gap at the k-th iteration 
of D-NG for the worst functions fi e J^, and the worst initial condition (provided ||x(0) — < 
R.) From Theorem 5, for any A; = 1, 2, S^'^^ (k, R) < C(A^)^ = 0(log A;/A;), where C(N) 
is given in (46). Now, consider the algorithm in [6] with the step-size ak = f^^+iy ' = 0, 1, 
where c G [0, 1/(2L)], r > are the degrees of freedom for the step-size choice. (We constrain 
c < 1/(2L), similarly as with D-NG.) With this algorithm, k = IC. We can show that there 
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exists a network (the N — 2-node connected network), a weight matrix W that satisfies (1), and 
the values for R, L and G, such that, with [6]: 

inf s(k,R^V2;T,c)^n(-l-], (61) 

T>o,ce[o,i/(2L)] V / \k^/^J 

where S {k, R; r, c) = sup^.^^ sup^^^o): \\x{o)-^*\\<R} ^^i=i,:.,N {f{xi{k)) - f*} is the worst-case 
optimality gap when the step-size ak — (^k+iy used. We omit the proof due to the lack of space. 

VIII. Simulation examples 

This Section provides simulation examples on the logistic (Subsections Vlll-A and Vlll-C), 
and Huber loss (Subsection Vlll-B). Simulations show that D-NG outperforms the existing 
algorithms in [6], [14], both in terms of the number of communication rounds and the number 
of iterations (per-node gradient evaluations). Further, D-NC is not competitive with D-NG for 
practical accuracies ({f{xi) — f*)/f* ~ 10~^ — 10~^ or coarser). Finally, by simulations in 
Subsection VIII-C, D-NG and D-NC confirm the linear scaling of the number of communication 
rounds for e-accuracy versus the inverse of the spectral gap. 

A. Distributed learning of the best linear classifier via the logistic loss 

Optimization problem. We consider distributed learning via the logistic loss; see, e.g., [5] for 
further details. Nodes minimize the logistic loss: YliLifii^) — Z^ili^'^S -)_ e~''*("i^^i+^°)^ , 
where x = Oj e is the node i's feature vector, and bi G { — 1,-1-1} is its class 

label. The functions : M"' i— > M, o? = 3, satisfy Assumptions 2 and 3. The hessian of 
fi^) = Eii f^i^) is: V2/(a:) = Eli y^^^M ^ "^^^'^ = ^ ^ Lipschitz 

(l+e i )^ 

constant L should satisfy ||VV(2;)|| < NL, Vx e R'^. Note that V^f{x) ^ I E^Ii CicJ , because 



< 1/4 for all y. We thus choose L — Eili 



0.3053, for this example. 



None of the /i's is strongly convex on R'^; / is not strongly convex on R*^ either; there exists 
a sequence x„ with lim^^oo ll^^nll ^ oo, such that lim^^oo ||V^/(a;„)|| = 0. However, for the 
numerical values that we considered, / was strongly convex in a neighborhood of the solution x* . 

Data. We generate independently over i; each entry is drawn from the standard normal distri- 
bution. We generate the "true" vector x* — {yf^, t/o)^ drawing its entries independently from 
the standard normal distribution. The class labels are generated as bi — sign (^x*i^ai + x*o + ^i) , 
where the e^'s are drawn independently from a normal distribution with zero mean and variance 3. 
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Network. The network is a geometric network: nodes are placed uniformly randomly on a 
unit square and the nodes whose distance is less than a radius are connected by an edge. There 
are = 100 nodes, and the relative degree ^^^^f^fif^) = 10%. 

Algorithm parameters and metrics. With D-NG, we set the Wj/s based on the neighbors' 
degrees as mentioned in Section II. With D-NC, [6], and [14], we use the Metropolis weights 
(that require the same knowledge to be set.) The step-size ak is: ak = l/{k + 1), with D-NG; 
a = 1/(2L), with D-NC; and l/{k + 1)p, with [6] and [14], with p e {1/3, 1/2, 1}. Note that 
we do not optimize the constant in the step-sizes with [6], [14] nor with D-NG - they operate 
without the prior knowledge of L,ijl{N). We simulate the relative average error across nodes 
ATp J2iLi{f{^i) ~ /*)' /* 7^ 0' versus the total number of communications at all nodes (= NK..) 

Results: D-NG versus D-NC. Figure 1 (left) compares D-NG (solid, red curve) versus D-NC 
in terms of the number of communication rounds JC. Solid, blue curve shows the performance 
of D-NC for Tjfc in (12) - the value that guarantees the convergence result in Theorem 8, and 
a — 1/(2L) - the largest step-size for which we can guarantee convergence on the entire 
class of the fi's that obey Assumptions 2 and 3. D-NG performs better than D-NC for the 
accuracies smaller than about 10~^; at higher accuracies, D-NC becomes better than D-NG. 
Further, Figure 1 (left) shows D-NC when we decrease (see the legend in Figure 1, left). For 
sufficiently small values of r^, D-NC converges only to a solution neighborhood. Note that, to 
appropriately set r^, nodes need to know an upper bound on n{N), and so this global knowledge 
is critical for the convergence of D-NC. Finally, Figure 1 (left) shows the behavior of D-NC 
when we increase the step-size a. For a sufficiently large a, D-NC fails to converge; again, the 
knowledge of (an upper bound on) L is critical. 

Results: D-NG and D-NC versus [6], [14]. Figure 1 (center) compares our algorithms D- 
NG and D-NC with [6], [14]. First, we can see that D-NG outperforms [6], [14]. For example, 
for the precision 10~^, D-NG reduces the number of communication rounds with respect to [6], 
[14] about 40 times - from about 12, 800 transmissions to about 423, 000 transmissions for [6] 
with ak = 1/(A; + 1)^ and p = 1/2. Note also that D-NG achieves a faster rate than [6], [14]; D- 
NG achieves about 1//C^, while [6], [14] with p— 1/2 achieve about 1//C. Finally, with both [6], 
[14], the step size choice p— 1/2 performs better than p = 1 or p = 1/3. D-NC performs better 
than [6], [14] for the accuracies 10^ or finer, and worse for the lower accuracies. 
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B. Huber loss 

We also consider the Huber loss cost functions, which finds applications in, e.g., distributed 
estimation in sensor networks [1]. Given a node z's private, scalar value (sensor's measurement) 
Oi, the /i : M ^ Mis the Huber loss fi{x) = \\x-e^f if < 1, and fi{x) = -1/2, 

else. The /j's obey Assumptions 2 and 3, and no fi is strongly convex, nor is f{x) = J2iLi fi{^)- 

The setup is the same as in the previous subsection, unless specified here otherwise. The 
network is a geometric graph with = 101 nodes and the relative degree ^ 10%. We set 
9i = lOi, i = 1,...,N = 101; the Lipschitz constant L = 1; the algorithms in [14], [6] use 
p = 1/2; with D-NC, we set as in (12) and a = 1/(2L). Figure 1 (right) compares D-NC, 
D-NG, [6], and [14]. We can see that D-NG significantly outperforms the other algorithms. 
Also, it is worth noting that here D-NG and D-NC do not intersect at a high accuracy like 
with the logistic loss example; rather, D-NG persists in being better. Note that this does not 
contradict the theory, as both D-NG and D-NC achieve (from Figure 1, right) about 0(1//C^). 




number of communictions (all nodes) [ log^^ ] number of communications (ail nodes] [ log^^ ] number of communicafions (ali nodes) [ iog^^j ] 



Fig. 1. Average relative error X^i^i (fi^i) ~ /*) versus the number of communications (all nodes) NIC. Left and center: 
Logistic loss; Right: Huber loss. In the left Figure, 7 = 1 — fJ,{N) is the spectral gap. 

C. Logistic loss: Network scaling 

We simulate the dependence of the number of communications /C(/u; e) required for the e- 
accuracy versus the inverse of the spectral gap 1/(1 — /i), in the presence of global knowledge 
before the algorithm run. Ignoring ^-small and logarithmic factors, with both D-NG and D-NC, 
but also with distributed dual averaging in [14], theory predicts (at worse) linear asymptotic 
dependence 0(1/(1 — /u)) when /i — 1^. We consider a geometric graph with N = 7 nodes. We 
set the weight matrix as = / — wC, where C is the unweighted Laplacian matrix, and w > 
is the weight. We emulate the network scaling (deterioration of the spectral gap) by decreasing 
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the weights. We let w e {0.0714; 0.0238; 0.0143; 0.0102; 0.0079}. Note that /i = l-w\2iC) = 
1 — 3w, in this example. The step-sizes are: 1) D-NG: ak = w/{k + 1) (See Theorem 6); 
2) dual averaging: ak = w^/'^/{k + if/"^ (See Theorem 2 in [14]); 3) D-NC: a = 1/(2L). 
We find /C(/i; e) - the smallest number of communication rounds at which the average relative 
error J2iLiifi^i) " f*) f^^^^ below e. Figure 2 plots /C(/i; e) versus 1/(1 — yu) for D-NG, 
D-NC, and distributed dual averaging; Figure 2, left is for j^Yl!i=i{f{^'i) ~ /*) = 10^^'^, 
while Figure 2, center is for ]v7^ ~ /*) — 10^^. First, we can see that, with all 

three algorithms, the theoretical linear dependence is confirmed well in simulation. Second, 
D-NG is the best for both higher and lower precision. Lastly, the gain of D-NG over D-NC 
decreases with the increase of precision, as predicted by theory. (See Figure 2, right.) Namely, 
p(/i) := /C(yu; e; D — NC)//C(yU; e; D — NG) is much smaller (p(/u) equals 2 — 3) for the higher 
accuracy e = 10"^ than for the lower accuracy e = lO^^ '' (p(/i) equals 6 — 6.5.) Theory predicts 
that, when e — )■ 0, D-NC eventually becomes better; this occurs only at very small e. 



-H-D-NG, avg. re[. error 
-^dual averaging, avg. 
-S-D-NC, avg. rel. error 


=10"'" 

el. error =10"'" 
=10"'" 












'*I''D-NG, avg. rel. error =1 
■O-'D-NC, avg. rel. error =1 0"^ 



20 30 40 50 

l'(l-ri) 



■■'avg. rel. error = 10' 
—avg. rel. error = 10'' 



Fig. 2. Network scaling: Number of communication rounds /C(/i; e) needed to reduce the average relative er- 
ror -ffp^'^^iifixi) — /*) below e versus the inverse spectral gap 1/(1 — /i). Left: Low accuracy: e — 10~^ *; Center: 
High accuracy: e = 10"^ Right: Ratio p(^i) = K.{^l\ e = 10-^)/K;(m; e = 10"^ ''). 

IX. Conclusion 

We studied distributed optimization in networks where nodes minimize the sum of their 
individual cost functions Xlili /i(^) subject to a global variable x E M.'^. Existing work has 
proposed distributed gradient based algorithms to solve the above problem and studied their 
convergence rates, under a wide class of convex, non-differentiable /j's, with bounded gradients 
- for unconstrained problems (Lipschitz //s over the constraint set - for constrained problems.) In 
this paper, we asked whether faster convergence rates (than the rates established in the literature) 
can be achieved on a more structured class of //s - convex, with Lipschitz continuous gradient 
(with constant L) and bounded gradient (with constant G). Building from the centralized Nesterov 
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gradient, we answer affirmatively this question by proposing two distributed gradient algorithms. 
Our algorithm D-NG achieves the rate O ( (i_^(]v))3+e ^) and O ( (i_^(jv))3+{ ^ ), when the 
knowledge of global parameters L and iJi{N) is not available before the algorithm run. The rate, 
for the optimized step size, improves to O ^^^^ and O (^ (i_^(]v))i+e ^^j' when L 

and iJi{N) are available before the run. Our algorithm D-NC operates only if L and /u(A^) are 
available and achieves the rate O (^ (i_^(jv))2 ic^) ^ (p)- showed that (under a 

fixed A?^) our methods achieve strictly faster rates than the method in [6], and that, under a fixed 
accuracy e, the communication cost with our methods scales linearly with the inverse of the 
spectral gap 1/(1 — ii{N)). Finally, simulations with the logistic and Huber losses show that our 
D-NG algorithm outperforms the existing methods. 
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Appendix A 

Proof that condition (1) holds for the weight choice example in Section II 

Consider the weight choice: Wij = 1/(1 + 3max{(ii, rfj}) for {i.j} G E; Wij = for i 7^ j 
and {i,j} ^ E; and Wu = 1 — J^j^^i^ij- prove that for such a weight choice, under 
Assumption 1 of the connected underlying network, property (1) holds. 

We first prove the second condition in (1). Consider the associated weighted Laplacian Cyj :— 
I — W, and the associated unweighted Laplacian C with Cij — —1 if, and only if, i ^ j, Wij > 0; 
dj — if, and only if, i ^ j,Wij — 0; and jCu — —Ylj^i^ij- Then, to prove the second 
condition in (1), we have, by the inequality of the spectral and the max-row-sum norms, and 
because \[Cyj]ii\ = Y^j^i \[^w]ij\- 

W^wW < \\^w\\i ^ max \[Cyj]iA < 2 max — — ^ < 2/3. 

i=l,...,N^'' i=l,...,N 1 + 3di 

Thus, Xi{W) = X2{W) = 1 - ||>C^|| > 1 - 2/3 = 1/3 > 0, and the second condition in (1) 
holds. 

We now prove the first condition in (1). Recall that iJi{N) — ||iy — J|| = ||7 — J — and so 

lj,{N) = max{|l-A2(>C^)|, \l-XN{jCy,)\}, which, because < X2ijCw) < AAr(>C^) = \\j^w\\ < 1, 
equals iJi{N) — 1 — X2{Cw). Further: 

>^-2{£^w) = min p^C^p^ min Wij(pi - pjf 

{||p||=l,p"ri=o} 1 + 3c(niax 1 + 30(niax 

because the second smallest eigenvalue of the unweighted Laplacian £ on a connected network 
is positive. Thus, li(N) <1- --^ — Asf^) < 1. 

Appendix B 

Proof of the result in Remark below Theorem 5 

We let Assumptions 1-3 hold, and we estimate the optimality gap — /*) at any 

node i when condition c < ^ does not hold. Suppose c > Denote by k' — 2cL. We will 
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show that: 

2 



+ 16c^LC4,,G^ + cCeonsG^ (62) 
Progress equation (51) still holds if L'^ = ^^^M > 2ArL, i.e., if A; > A;' := 2cL. Telescop- 



ing (51) backwards from k > k' until k', and using '■^t.^. > k 



^ {f{x{k)) - n 



2 



< 



t=fe' 



- 1)) - n + -imk' - i)r + 2\\x^') + -Y^ \m - 



2 



(63) 

Lemma 4 holds unchanged if c > 1/(2L), and so (Compare also to (49)): 

11l|2ii±l)!<16c%^2 ^2^^ (^ + 2)^ ..4, 
^ 2. \W - 1)11 < ^^^^consG 2. (64) 

t=l t=1 ^ ' 

We now upper bound — 1)||, where we recall v{k) — ^y{k) — ^-^x{k), and 6k — -j^. 
Then: 

||^;(^ - 1)11 < -^Mk - 1)11 + fc||x(A; - 1)11 < i2k + l)M,_i, (65) 

where Mk-i := max{||y(A; - 1)||, \\x{k - 1)||}. By (31), using Uk-i = c//c, using pk-i < h 
and||EiiV/i(y,(A;-l))||<7VG: 

< ||y(^-l)|| + ^<M,_i + ^ 

||y(A;)|| < 2||x(A;)|| + \\x{k - 1)11 < 2\\y{k - 1)|| + \\x{k - 1)11 + ^ < 3M,_i + 

and so: 

2cG 

Mk<3Mk-i + —, k = l,2,..., Mo = 0. 
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By unwinding the above recursion from A; = A;' — 1 to A; = 0, we get: 

M,,_i< f^— ^j2cG. (66) 
Further, combining the last equation with (65), and squaring the resulting inequality: 

|2 / /o/.' , 1^2 / "3 \ ^„2^2 



\\v{k' - < (2A;' + ly ^Y^J ^^^^ 
Using the Lischitz continuity of / (with constant LN), and using Vf{x*) — 0: 

f{x{k'-i))-r < vf{x^y{x{k'-i)-x*)-r^^\\x{k'-i)-x^f 

< {LN){\\x{k' -l)f^-\\x*f) 

< {LN) ({^^^) ic'G'+\\xT 

where we used (66) to upper bound Mfc/_i. Finally, combining the last equation with (63), (64), (67), 
using Ylt=i ^^75^ — ^' ^^'^ repeating the same argument as in (53): 

^(/(^.(fc))-r)<c'(iv)|j^^, k>k', 

where, recalling < i? as x(0) = y{0) = 0, we obtain (62): Compare C{N) in (46) with 
C'{N) in (62): C'{N) > C(iV): the first (nonnegative) summand ^^^L (^{^^)'^ Ac'^G^ + i?^' 
in (62) does not appear in (46); also, the second summand f (2{2k' + if {^^^ Ac^C^ + 2R'^ 
is larger than the corresponding summand ^ in (46). 

Appendix C 
Proof of Theorem 6 

Consider the sequence of the weight matrices W^^\...,W^^\..., where W^^'' has the dimension 
NxN. Write simply W = W^^\ and = - J, and recall that ii{N) = \\W\\. Suppose that 
\2{W) = i.e., there exists sufficiently large A^o and A e (0, 1), such that X2{W) > A > 0, 
for all N>No. 

We first show the scaling result for Ccons — C'cons(^) in (32). Then, we apply the scaling result 

for Ccons to the constant C'{N) in (62) to prove part (a) of the Theorem: C'{N) = O (^ (i_^(]v))3+e ) 
as A'^ ^ 00; finally, we use the scaling for Ccons and the constant C{N) in Theorem 5 (equa- 
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tion (46)) to prove part (b) of the Theorem: C{N) — O ^ (i_^(jv-))i+g ) N ^ oo. 
Scaling of Ccons in (32). We will show: 



Ccons = O ( ,\rv.^/9^, ) when ^ oo, (68) 



;i - /x(iV))3/2+5 

for arbitrarily small ^ > 0. We separately upper bound each of the terms in Ccons ^ 

1/min {^A2 W(l - X2{W)), ^^^{N){1 - 2B(v/M^), '^'T'\-io,\iN)y j-^. 

First consider: 1/ min | y^A2(i^)(l - \2{W)), V/i(iV)(l - Ai(Ar))|. We have \2(W){1 - 
X2{W)) > A(l - /x(7V)), TV > No; likewise, /x(7V)(l - i^{N)) > A(l - /x(7V)), TV > Nq. 



Thus: 



1/min |i/A2(vr)(i - X2{w)), - /^(^))} < ;^ 0Tr7i(]^ ' ^^^^ 

Further, recall the definition of the function B : (0, 1) M: 



B{r) = sup {2;r^ log(l + 2;)} . 

2>l/2 

For any ^ > 0, there exists a constant KbH) G (0, 00) such that: log(l+2;) < Kb{^)z^, \/z > 1/2. 
Thus: 

B(r) = sup {zrMog(l + ^)} < irB(C) sup {z^+«r"} 

z>l/2 2:>l/2 

Note from the above that B{r) < 00, for all r e (0, 1). 

Now, consider the term: 2(2*^^+^) 1 Because the function — — < for all z e 

' e (— log/i(W)) —log 2 — 1— 2' 

(0, 1), we have that: 

2(2Cfe + 1) 1 ^ 2(2Cfe + 1) 1 

e (-log MAT)) - e l-M^')' 

Consider next the term ?^=. Because tj-^^ < t^, for all 2; e (0, 1), we have: 

2 4 



1 - y/JI{N) - 1 - l^{N) 

Now, consider the term 2B (^^Jn{N)j. From (70), and because l/(— log v^) < 2/(1 — 2;), 



October 2, 2012 



DRAFT 



37 

^e(0,l): 

2B{.Mn)) < '^^^(Oe-^'^'\l + 0^'^' \^_^l^^y^, - (73) 

Combining (69), (71), (72), and (73), we obtain that Ccons(A^) < Meons(0 (1-^(^)3/2+0 for all 
N > No, where the constant Mcons(0 is independent of N and of the matrices 1^*^^^, N > 1. 
Thus, the result in (68). 

Proof of part (a) of Theorem 6. Recall that we assumed that L, G, R, and c do not depend 
on N, and also k' = 2Lc does not depend on A^. In (62), only Ccons, given by (32), depends 
on N. Consider C'{N) in (62). We can see that C'{N) = 0{C^^J = 0(1/(1 - i^{N)f+^^) = 
0(1/(1 - n{N)f+^'), where ^' = 2^ > is arbitrarily small. 

Proof of part (b) of Theorem 6. Plugging c = 0(1- n{N)) in C{N) in (46), we have C{N) = 

O (iz^ + ( wfcv))'^'' (1 - l^m' + = 0(1/(1 - l^{N)y-'n, where ^' = 

2^ > is arbitrarily small. This completes the Proof of Theorem 6. 



Appendix D 
Proof of Theorem 9 

We will prove that, after JC communication rounds, ^{f{xi) — /*) < C{N) where 
C{N) — O ^ (i-^(Ar))2 ) when N ^ 00. We first make the dependence of the optimality gap 
on JC more explicit. From (58)), for arbitrary small ^ > 0, there exists a constant Oo(^) G (1, 00) 
that depends only on such that: 3fclog3 + 3{k + 1) log(A; + 1) < Oo(0fc^"^^ Vfc > 1. From 
the last equation, for /C in (58): /C < 00(0 - \og^(N) ^^^^ ^ <^o(Ot3^ ^^^^ where the last 
inequality uses l/(— log^;) < 1/(1 — z), z e (0,1). Raising to power 1/(1 + and using 
Oo(0 > 1, 1/(1 + > 1 - ^> e e (0, 1), gives: 

;Ci-C</C^/(i+0<Co(O^^^^A;. 



From the last equation: 

1 



< (Co(C72)) 
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where we introduced ^' = ^/2. Applying the last equation to Theorem 9, we obtain that, after 
JC communication rounds, jf{f{xi) — /*) is, for any node i, upper bounded by: 

for arbitrarily small ^' > 0. Thus, after JC communication rounds, j^{f{xi) — /*) < C{N) 
where C{N) — O (^^juj^^^ when N ^ oo. Thus, the result in Theorem 9. 

Appendix E 

Proof of the il{l/k^/^) lower bound on the worst-case optimality gap for [6] 

We prove (61) by constructing a hard example of the functions fi and a hard initial condi- 
tion x{0). 

Network and functions fi. Consider N — 2 nodes, and set the 2 x 2 weight matrix W as 
Wii — W22 — 1 — w, and 1^12 = W21 — w, and w — 1/8. The eigenvalue decomposition is 
W — QAQ^, with Q — [qi, ^2], the columns qi — :^(— 1, 1)^, ?2 = 1)^' ^ diagonal 
with the eigenvalues An — Xi — 1 — 2w — 3/A, A22 = A2 = 1. 

We set d = 2. We denote the coordinates of the 2-dimensional variable xhy x = {x^^\ x^'^^)J; 
for example, the estimate at iteration k of node i is then denoted by Xi{k) = ^x[^\k) , x\^\k) 
The nodes solve problem (2), where the functions : — )■ M are, for i = 1,2: 



X ( + i-iyf + + ^2 _ x\ else. 

(75) 

Here x > is a constant that we set to x = 6; is a constant in [0, 1], for which we 
subsequently design a particular value that makes problem (2) hard. The function fj' is similar 
to the Huber loss; it is quadratic in the region 

7^^ = {x e ri{x^'^ + (-l)O' + (x(') + (-l)O' < t} , (76) 

and outside of this region it behaves as the norm x — > The solution to (2), with f{y) — 
fi{y) + /2 (y)> is X* — (0, 0)^, and the corresponding optimal value is /* = 77 + 1. 

Properties of the /J"s. We now show that the f^'s are convex, have Lipschitz continuous 
gradient with constant L = y/2, and bounded gradients ||V//'(a;)|| < 10, for all x, i — 1,2. 
Thus, the fl"s in (75) belong to the class J" = F{L = ^2, G = 10), for any ri e [0, 1]. 
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To show that the function x i-> fi{x) is convex, note that it can be represented as the following 
concatenation: x ^ y — {^/fj{x'^^^ — 1), (x^^^ — 1))^ ^-^ z — \\y\\ ^-^ w — fh{z) — fi{x), where 
: R+ ^ R is the Huber loss: fh{z) = 1^^, if \\z\\ < % and fh{z) = x{M\ - x/2), else. 
Hence, x ^ fi{x) is a concatenation of an affine function, a convex function, and a convex 
non-decreasing function, and hence it is convex. Analogously, we can show that x ^ f2{x) is 
convex. 

We now show the Lipschitz continuity and the boundedness of the gradient of the gradient 
equals: 



(?7(x(i)-l), (a;(2)-l))T ifxeUi 

S n72{v{x^^^ - 1), (x^^^ - 1))^ else. 

(77) 

The first coordinate of the gradient x ^ {x) can be expressed as the following concatenation 
of the functions: x ^ y = {x^^^ — l,x'^'^'> — 1)^ ^ z = (y^y*-^-*, y*-^-*) ^ w = Y'io]q^_{z) i— )■ 
^ ^ ^^it(^) ^ \/V^^^^^ where Proj5|j_(z) is the projection of z on the ball centered at zero 
with radius x- AH the functions cf) in the concatenation above are Lipschitz continuous with 
constant one, and so x i-> is also Lipschitz continuous with constant one. (Given the 
function where 0-s are Lipschitz continuous of constant one, we have 

...\\u — v\\.) Similarly, we can show that x i— 7> is Lipschitz continuous with constant one. 
This implies that the gradient x i— )> V/f (x) is Lipschitz continuous with constant a/2. Also, 

II Q (x^ II < . /ri V < fnr a11 x TRppall the p.nnp.atp.natinn renrpsentatinn m n — (nny^J 



< ^/VX ^ 6, for all x. (Recall the concatenation representation x t-^ y — {x^ 



-ly ^ {y/rjy^^\y^'^^) Proj5^__(^) ^i^) = ^/rjw^^^; then, for any 

X e R2, ||^(x)|| < V^||Proj5„^^(z)||, for some z e R^ and so ||^(x)|| < ^x-) Similarly, 
||^(a;)|| < X < 6. Thus, for the gradient, we have: ||V/{'(2;)|| < 6^2 < 10, for all x. We can 
analogously show that ||V/2''(x)|| < 6^/2 < 10, for all x. 

The algorithm in [6]. Now, consider the algorithm in [6], and consider a;j(A:)-the solution 
estimate at node i and time k. Denote by x^k) = {xi\k),X2\k))^-th& vector that stacks 
the l-th coordinate of the solution estimate of both nodes, I — 1,2. Also, denote by d^k) — 
^^Sr^)' ^ = 1' 2- Then, the update rule of [6] is, for the in (75), and for 

Z = 1,2: 

x^(k) = Wx^{k - 1) - ak-id\k - 1), A; = 1, 2, ... (78) 
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We set the specific initialization a;^(0) = (1, 1)"^, and x"(0) = (0, 0)"^. 

Performance evaluation of [6]. We show that, under the above initiaUzation, Xi{k) belongs 
to the region TZi in (76) (the region where the is quadratic), for all k, for both nodes i — 1,2. 

We fist prove that, if ||.x^(A;)|| < 2^2, and < 2^/2, then Xi{k) G 7^i, i = 1,2. Consider 
node Vs estimate xi{k). If ||a;^(A;)|| < 2\/2, and ||a;"|| < 2^2, then ||a;S'^(A;)|| < 2\/2, / = 1,2, 
and: 

r]{x'^\k) - 1)2 + {xf\k) - 1)2 < 2(2^2 + 1)2 < 2(2^ + 1)2 < 32 < t = 36, 

which means ,ri(A;) G TZi. (Analogously, we can show X2{k) G 7^.2.) 

We next prove that ||a;'(A;)|| < 2\/2, I = 1,2, for all k; we do this by induction. For k = 0, 
||a;'(0)|| < 2V2, I = 1,2. Now, suppose that, for some k > 1, \\x\k - 1)|| < 2^2, / = 1,2. 
Then, the update equations (78) to get x^{k) and x"(/c), using the derivatives of the f^'s in the 
quadratic region in (77), become: 

x\k) = {W-ak-ivI)x\k-l)-ak-iv{-l,lV (79) 
x'\k) = {W -ak-il)x'\k-l)-ak-i{-l,lf . (80) 

From (79)-(80), using the sub-additive and sub-multipUcative properties of norms, and the step- 
size value ccfe-i = c/{k'^): 

wAm < {i-'^)\\x\k-i)\\ + f^V2 

- \\x\k-l)\\-'^{\\x\k-l)\\-V2). 

Now, we distinguish two cases: 1) ||a;^(A;)|| G [0,\/2]; and 2) ||a;^(A;)|| G (^2, 2^2]. In case 1, 
from the last equation: 

\\x\k)\\<\\x\k-l)\\ + ^<2V2, 

where we used < c < l/(2\/2) = 1/(2L) and < 7^ < 1. In case 2: 

\\x\k)\\ < ||a;^(A;-l)|| < 2\/2. 

Thus, we have shown that ||a;^(A;)|| < 2\/2. Similarly, we can show that ||x"(A;)|| < 2\^. Thus, 
by induction, \\x\k)\\ < 2\/2, / = 1,2, for all k, and so Xi{k) G TZi, ^ = 1) 2, for all k. 
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We now evaluate the sum of the nodes' optimaUty gaps Yl'i=i ~ /*) > /(^) — /i'(^)+ 

/2 (x). Because Xi{k) e TZi, i — 1,2, verify, using (75), and /* = 1 + 77, that: 

^ {f{x,{k)) - n = v\\x\k)f + \W\k)f. (81) 

i=l 

Because e T^-j, for all k, the recursions (79)-(80) hold for all A; = 0, 1, 2, and we can 
write the solution for x^{k) and By unwinding the recursions in (79)-(80), and using 

x\<d) = a;"(0) = (0,0)^: 



x\k) = {W-ak-ir]I){W-a^_2r])...{W-o^or]I){l,l) 



T 



'k-2 



+ - - Oik-2r]I)...{W - at+ivl)at + a^-il j (-1, 1)" 



We simplify the above using the eigenvalue decomposition W — QAQ^. The matrix W—ak-irjI 

decomposes as W — ak-ii]! = Q{A—ak-ir]I)Q^ , and, similarly, W — ak-iI = Q{A—ak-iI)Q^ . 
Then, the products {W - ak-ir]I){W - ak-2'n^)---{W - "t+i*?-^) = - ak-ir]I)...{A - 
at+irjI)Q^, and {W - ak-iI)...{W - at+J) = Q{A - ak-iI)...{A - at+iI)Q^. Using these 
decompositions, and the orthogonality: qj{l,l)^ — 0, and 1, 1)^ = 0, we obtain: 

x\k) = (l-a;fe_iry)(l-ait_2r7)...(l-«or?)(l,l)^ (82) 

(k-2 
J^(Ai - afc_i?7)(Ai - afc_2?])...(Ai - at+ir])at + ak-i 
t=o 

x'\k) = (-1, ly (^(^i - - «ik-2)-(Ai - at+i)at + j . (83) 

We next upper bound the norm of x^{k). Note that Ai — ak-irj — 3/4 — > 1/4, for all fc, r, c. 
Also, Ai — ak-iT] < Ai = 3/4, for all k,T,c. Similarly, we can show 1 — ccfe-iTy e [1/2,1]. 
(Note that the terms (1 — ak-ir])...{l — aorj) and (Ai — Q;fe_i77)...(Ai — ctj+iTy) in (82) are then 
nonnegative, Vi.) Now, we can upper bound the norm of x^{k) as: 

fc-i ^ 

\\x\k)\\ > (1 - ak-iv) (1 - (^k-2V) - (1 - (^oV) - 77c ^A^*-^ 

t=o ^ 
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Further, using the inequality (1 — ai)(l — 02). ..(1 — On) > 1 — (oi + 02 + ... + a„), e [0, 1), 
Vi, and afe = 

fc-i 



Wll> (l-r/cJ]^^j-r,cC(AO^, (84) 



where we used Y^'lZl \\~*~^ {t+iy — ^(-^1)^' ^^^^ C'(Ai) e (0, 00) depending only on Ai, and 
not depending on r. (The latter can be proved similarly to the proof of Lemma 4.) Further, we 
assume from now on that r] e [0, 1] and A; = 1, 2, ... are such that: 

l-^^E(^)-^^^(^0^>0. (85) 

(Note that such a choice of rj, k exists; e.g., 77 = 0, and k is arbitrary.) Then, we can square (84); 
using the inequality (a — 6)^ > — 2ab, a, 6 > 0, we obtain: 



Finally, using the inequality k^ ^ — 1 < X]t=o^ {t+iy — ^' multiplying the last equation 
by 77: 

V\\x\k)\\' > rj{l-2nck'-^)-2n'cCi\^){l-ricik'-^-l))^ 

= 77 (1 - 2rick^-^) - 2ifcC{\^) (1 - r^cA;^-^) ^ - 2773c2C(Ai)^. 

Now, we set rj = r]{k,c,T) = ^^^i-t ■ Then, plugging ri{k,c,T) in the above equation: 

vWxHkW > — h l^C(Xi)^ l^C(Xi) 



8ck^-^ 8cP-^ ' M 32cA;3-2^ 
1 



^ / _ 3C(Ai) _ C(Ai) \ 
The relation (86) is valid if, only if, the condition (85) holds; we now check (85) for 77 = 43^1=7 



k-1 



1 - rye J] ^^^j - 77cC(AO^ > (1 - 77cA;^--) - ricC{X,)^^ = 3/4 - C(A0/(4A;), 



and so (86) holds for all k > C(Ai)/3. 
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We continue by upper bounding the norm of x^^{k). From (83): 



\x'\k)\\>a,. 



c 



and so: 



\\^'\k)r>^- (87) 
Finally, combining (86) and (87), and using (81): 

ifix,{k)) -n>li2 imik)) - n > (i - c(ao^) 

i=l ^ ' 

We have just proved that, for all k > C(Ai)/3: 

Sik, R^V2;t,c)> (l - C(AO^) + ^ 0' ^ [0' 1/2^]- 

Further, for any A; > 2C(Ai): 

£:(^, = v^; T, c) > + Vr > 0, Vc e [0, 1/2^2]. 

Finally, taking the infimum over r > 0, c e [0,1/2^2]: 

inf ^(^,^= \/2;r,c) > — Jr^-^ = Q (-^ ) . 

r>0,c6[0,l/(2L)] 2(32)2/3 A;2/3 \k^/^ J 

The result above shows that there exists a class J-", namely, the J-'{L = \/2, G = 10), such that 
the algorithm in [6] cannot achieve the worst case error 6{k,R — \f2\ r, c) better than O (j^), 
for none of the parameter choices r > 0, c e [0, 1/(2L)]. In contrast, our algorithm D-NG 
achieves the worst case error e{k, i? = V2) = Oi^ogkjk) for any class T{L, G). 

Demonstration by simulation. We demonstrate by simulation the obtained lower bound on 
the worst-case optimality gap with [6]. Figure 1 plots the estimated decay rate C,{t) of £{k, R = 
~ fecFT 'versus r, for the constant c — 1/ (2^/2). All the remaining parameters are set as 
in the theoretical analysis above. For a fixed k and r, we estimate S{k^ R = \/2; r, c) as S{k, R = 
V2; T, c) = maxj=i_2{/''(xi(/i;)) - /*}, where rj ^ 1/ {Ack^'''), and msiKi=i^2{P{xi{k)) - /*} is 
the optimality gap with the algorithm in [6] after k iterations. ( The initialization is as explained 
above in the theoretical analysis, and with the step size at — c/{t + iy, t — 1, A;.) For a fixed 
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T, we estimate the slope ({r) as: 

^ logio(g(fc2, R = V2;t, c)) - logio(g(fci, R=V2;t, c)) 

C(^) = ] 77^ — ^ 77^ , (00) 

logio(^l) - loglo(^2) 

where A;i = 900 and k2 = 1000. Figure 1 shows (modulo a possible small offset due to the 
imperfect estimation in (88) of the asymptotic slope) that, with [6], S{k,R = \^;t,c) indeed 
cannot be better than £{k,R= \pi\ r, c) 
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Fig. 3. Estimated decay rate C,(t) of £{k,R — y/2;T,c) 
c/(fc + l)", c= 1/(2^/2). 



^ versus r, for tfie algorithm in [6] witli step-size 



Remark. We provide here an intuitive explanation why D-NG achieves a faster rate under 
Assumptions 1-3 than the method in [6]. With [6], it can be shown that the global average 



J2iLi^ii^) of the nodes' estimates evolves according to a (centralized) ordinary gradient 



N 



under inexact oracle: x{k) = x{k — 1) — ^^77- Xlili The "inexactness" constant is 

Sk = L J2iLi ll^(^) ~ a;j(A;)p. If [6] uses the step-size = c/{k + ly, k = 0,1, r G [0, 1], 
it can be shown that 5k = O (p^), and hence, the faster the step-size decays (the larger r), the 
better (the smaller Sk). On the other hand, the "optimization" process becomes slower when r 
increases: the ordinary gradient's optimality gap is O (p^) , (assuming 5k = 0.) Thus, there is 
a tradeoff between the disagreement (or 5k) and the "optimization process." It turns out that this 
tradeoff gives the optimal r to be strictly positive, so that the algorithm [6] achieves O (p^), 
hence strictly worse than D-NG's rate 0(logA;/A;). 
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